Skip to content

[training] fix: prevent crash on side-threads#4375

Open
tdene wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
tdene:tde/fix_threading_crash
Open

[training] fix: prevent crash on side-threads#4375
tdene wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
tdene:tde/fix_threading_crash

Conversation

@tdene

@tdene tdene commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

What does this PR do ?

#3823 causes a crash when running on a side-thread, due to the fact that Python does not allow for signal handling to occur on non-main threads.

2026-06-15T22:07:04.5888674Z �[36m(MegatronPolicyWorker pid=26946)�[0m Exception raised in creation task: The actor died because of an error raised in its creation task, �[36mray::lm_policy-0-0:MegatronPolicyWorker.__init__()�[39m (pid=26946, ip=172.17.0.2, actor_id=251558e2695df89123e6592e01000000, repr=MegatronPolicyWorker[rank=0])
2026-06-15T22:07:04.5938895Z �[36m(MegatronPolicyWorker pid=26946)�[0m   File "/root/.local/share/uv/python/cpython-3.13.13-linux-x86_64-gnu/lib/python3.13/concurrent/futures/_base.py", line 449, in result
2026-06-15T22:07:04.5940018Z �[36m(MegatronPolicyWorker pid=26946)�[0m     return self.__get_result()
2026-06-15T22:07:04.5941248Z �[36m(MegatronPolicyWorker pid=26946)�[0m            ~~~~~~~~~~~~~~~~~^^
2026-06-15T22:07:04.5942520Z �[36m(MegatronPolicyWorker pid=26946)�[0m   File "/root/.local/share/uv/python/cpython-3.13.13-linux-x86_64-gnu/lib/python3.13/concurrent/futures/_base.py", line 401, in __get_result
2026-06-15T22:07:04.5943656Z �[36m(MegatronPolicyWorker pid=26946)�[0m     raise self._exception
2026-06-15T22:07:04.5944689Z �[36m(MegatronPolicyWorker pid=26946)�[0m   File "/opt/nemo-rl/nemo_rl/models/policy/workers/megatron_policy_worker.py", line 337, in __init__
2026-06-15T22:07:04.5946225Z �[36m(MegatronPolicyWorker pid=26946)�[0m     model_and_optimizer_state = setup_model_and_optimizer(
2026-06-15T22:07:04.5946913Z �[36m(MegatronPolicyWorker pid=26946)�[0m         config,
2026-06-15T22:07:04.5947513Z �[36m(MegatronPolicyWorker pid=26946)�[0m     ...<2 lines>...
2026-06-15T22:07:04.5948304Z �[36m(MegatronPolicyWorker pid=26946)�[0m         pre_load_checkpoint_hook=getattr(self, "_pre_load_checkpoint_hook", None),
2026-06-15T22:07:04.5949088Z �[36m(MegatronPolicyWorker pid=26946)�[0m     )
2026-06-15T22:07:04.5949984Z �[36m(MegatronPolicyWorker pid=26946)�[0m   File "/opt/nemo-rl/nemo_rl/models/megatron/setup.py", line 1048, in setup_model_and_optimizer
2026-06-15T22:07:04.5950890Z �[36m(MegatronPolicyWorker pid=26946)�[0m     state.cfg = megatron_cfg
2026-06-15T22:07:04.5951362Z �[36m(MegatronPolicyWorker pid=26946)�[0m     ^^^^^^^^^
2026-06-15T22:07:04.5952430Z �[36m(MegatronPolicyWorker pid=26946)�[0m   File "/opt/nemo-rl/3rdparty/Megatron-Bridge-workspace/Megatron-Bridge/src/megatron/bridge/training/state.py", line 168, in cfg
2026-06-15T22:07:04.5953315Z �[36m(MegatronPolicyWorker pid=26946)�[0m     self._set_signal_handler()
2026-06-15T22:07:04.5953790Z �[36m(MegatronPolicyWorker pid=26946)�[0m     ~~~~~~~~~~~~~~~~~~~~~~~~^^
2026-06-15T22:07:04.5954695Z �[36m(MegatronPolicyWorker pid=26946)�[0m   File "/opt/nemo-rl/3rdparty/Megatron-Bridge-workspace/Megatron-Bridge/src/megatron/bridge/training/state.py", line 466, in _set_signal_handler
2026-06-15T22:07:04.5956046Z �[36m(MegatronPolicyWorker pid=26946)�[0m     self._signal_handler = DistributedSignalHandler(self.cfg.train.exit_signal).__enter__()
2026-06-15T22:07:04.5956833Z �[36m(MegatronPolicyWorker pid=26946)�[0m                            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
2026-06-15T22:07:04.5957835Z �[36m(MegatronPolicyWorker pid=26946)�[0m   File "/opt/nemo-rl/3rdparty/Megatron-Bridge-workspace/Megatron-Bridge/src/megatron/bridge/training/utils/sig_utils.py", line 130, in __enter__
2026-06-15T22:07:04.5958707Z �[36m(MegatronPolicyWorker pid=26946)�[0m     signal.signal(self.sig, handler)
2026-06-15T22:07:04.5959224Z �[36m(MegatronPolicyWorker pid=26946)�[0m     ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^
2026-06-15T22:07:04.5960005Z �[36m(MegatronPolicyWorker pid=26946)�[0m   File "/root/.local/share/uv/python/cpython-3.13.13-linux-x86_64-gnu/lib/python3.13/signal.py", line 58, in signal
2026-06-15T22:07:04.5960896Z �[36m(MegatronPolicyWorker pid=26946)�[0m     handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler))
2026-06-15T22:07:04.5961635Z �[36m(MegatronPolicyWorker pid=26946)�[0m ValueError: signal only works in main thread of the main interpreter

Changelog

  • Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI section in the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com>
tdene added a commit to tdene/RL that referenced this pull request Jun 15, 2026
Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com>
@yaoyu-33 yaoyu-33 added area:training Training loop, callbacks, and runtime integration bug Something isn't working community-request needs-more-tests Requires additional L0 and L1 test coverage before merge needs-review PR is ready for code review and waiting on a reviewer labels Jun 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:training Training loop, callbacks, and runtime integration bug Something isn't working community-request needs-more-tests Requires additional L0 and L1 test coverage before merge needs-review PR is ready for code review and waiting on a reviewer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants